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Abstract 

Salvia miltiorrhiza is one of the most widely used medicinal plants. As a first step to develop a chloroplast-based genetic 
engineering method for the over-production of active components from S. miltiorrhiza, we have analyzed the genome, 
transcriptome, and base modifications of the S. miltiorrhiza chloroplast. Total genomic DNA and RNA were extracted from 
fresh leaves and then subjected to strand-specific RNA-Seq and Single-Molecule Real-Time (SMRT) sequencing analyses. 
Mapping the RNA-Seq reads to the genome assembly allowed us to determine the relative expression levels of 80 protein- 
coding genes. In addition, we identified 19 polycistronic transcription units and 136 putative antisense and intergenic 
noncoding RNA (ncRNA) genes. Comparison of the abundance of protein-coding transcripts (cRNA) with and without 
overlapping antisense ncRNAs (asRNA) suggest that the presence of asRNA is associated with increased cRNA abundance 
(p<0.05). Using the SMRT Portal software {vl.3.2), 2687 potential DNA modification sites and two potential DNA 
modification motifs were predicted. The two motifs include a TATA box-like motif (CPGDMMl, "TATANNNATNA"), and an 
unknown motif (CPGDMM2 "WNYANTGAW"). Specifically, 35 of the 97 CPGDMMl motifs (36.1%) and 91 of the 369 
CPGDMM2 motifs (24.7%) were found to be significantly modified (p<0.01). Analysis of genes downstream of the 
CPGDMMl motif revealed the significantly increased abundance of ncRNA genes that are less than 400 bp away from the 
significantly modified CPGDMMl motif {p<0.01). Taking together, the present study revealed a complex interplay among 
DNA modifications, ncRNA and cRNA expression in chloroplast genome. 
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Introduction 

Salvia miltiorrhiza, also known as Chinese sage, tan shen, or 
danshen, is highly valued in traditional Chinese medicine for its 
roots or rhizomes [1]. S. miltiorrhiza extracts are the main 
components of two of the ten best-selling traditional Chinese 
medical products, each with a total sale of around 1 00 million US 
dollars. To date, more than 70 active components have been 
isolated from S. miltiorrhiza [2,3]. These active ingredients can be 
classified into two groups: lipid-soluble tanshinones and water- 
soluble phenolic acids. Compounds from both groups exhibit 
antioxidant, antitumor, anti-inflammatory, and antimicrobial 
properties [1]. 

Given the pharmacological and economic importance of the 
active components of S. miltiorrhiza, their overproduction remains 
an active research area. Recent studies have shown that 
chloroplasts are ideal hosts for the expression of genes related to 
tlie production of secondary metabolites [4] because these 
organelles support (1) the accumulation of large amounts of 



protein; (2) transgene stacking because of the presence of 
polycistron; (3) proper protein folding; (4) site-specific integration; 
and (5) maternal inheritance, which prevents the dissemination of 
transgenes through pollen. However, a detailed understanding of 
the chloroplast genome structure, gene contents, and gene 
expression regulation is a prerequisite for the development of 
effective chloroplast-based transgenic systems. 

Chloroplasts are plant-specific organelles for photo.synthesis, 
starch storage, nitrogen metabolism, sulfate reduction, fatty acid 
synthesis, DNA and RNA .synthesis, and the synthesis of 50% of 
soluble protein in leaves [5]. By March 2014, 442 complete 
Viridiplantae chloroplast genomes are deposited in GenBank. 
Comprehensive reviews have been conducted on the (1) structure 
and biology of chloroplast genomes [6] ; (2) RNA metabolism and 
translation regulation [7]; and (3) the coordination of gene 
expression between organeUar and nuclear genomes [8] . Although 
the S. miltiorrhiza chloroplast genome has been recently assembled 
and analyzed [9], the expression profile, as well as the presence 
and functions of noncoding RNA (ncRNA), and DNA modifica- 
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tions have not been studied in depth. Thus, the current study aims 
to clarify these issues. 

ncRNAs are RNAs that do not encode proteins, but have 
multiple functions. ncRNAs can be clas.sified into two broad 
categories: house-keeping and regulatory. House-keeping ncRNA, 
which include ribosomal, smaU nuclear, small nucleolar, and 
transfer RNA (tRNA), is constitutively expressed [10]. On the 
other hand, regulatory ncRNA are often expressed during specific 
developmental stages or under particular nutritional or environ- 
mental conditions [11]. One ncRNA subset, called natural 
antisense transcripts (NATs, also called asRNA), are endogenous 
RNA molecules with regions complementary to a sense transcript 
(cRNA). Through base-pairing with their targets, these asRNA 
elicit translational inactivation/activation, mRNA stabilization/ 
destabUization, or differential transcription termination [12,13]. 

Previous studies have identified dozens of ncRNA in chloro- 
plasts [10,14]. A number of studies have elucidated the regulatory 
roles of chloroplast ncRNA. For the first example, an ndhS asRNA 
covers two editing sites of the ndhR gene and a group II intron 
splice acceptor site, and might play important role in RNA 
maturation or stability [14]. For the second example, two different 
antisense RNAs of j^.s^Tgene were found to form double-stranded 
RNA/RNA hybrids, which results in translational inactivation of 
the psbT mRNA.. The hybrid was further hypothesized to provide 
protection against nucleolytic degradation of mRNA during 
photo-oxydative stress conditions [15]. For the third example, 
antisense RNA oirbcL and atpQ genes form stable double-stranded 
molecules with sense strand transcripts and prcv(-nts their 
translation [16]. For the forth example, the expression of 
complementary RNA of RpoB-2 from chloroplast transgenes 
affects editing efficiency of transgene and other endogenous 
chloroplast transcripts [17]. For the last example, overexpression 
of ASS, a chloroplast-encoded asRNA complementary to the 5S 
rRNA in tobacco, destabilizes 5S rRNA and retards plant growth 
[18]. These findings suggest that the ncRNA in the- chloroplast 
genome has various regulatory roles. Therefore, identifying and 
characterizing the ncRNA in the S. miltiorrhiza chloroplast genome 
would help reveal the functions of this important group of gene 
expression regulators. 

DNA contains a large variety of functionally important 
modifications [19]. The most common modification, DNA 
methylation, involves the addition of a methyl group to cytosine 
or adenine DNA nucleotides by methyl transferase (MTase). The 
most common methylation types are N6-methyladenine (m6A), 
N4-methylcytosine (m4C), and 5-methylcytosine (m5C) [20]. In 
plants, DNA methylation is essential for growth and development, 
and it affects gene expression, genomic imprinting, heterochro- 
matin assembly, and protection of the genome against migrating 
transposable elements [21,22]. It should be pointed out that most 
studies on the effect of DNA modifications on gene expression 
focus on the nu[:lear DNAs. For chloroplast DNAs, Ahlert et al. 
[23] observed the insensiti\'ity of plastic! gene expression to both 
adenine and cytosine methylation of the plastid DNA by 
expressing two cyanobacterial DNA methyltransferase genes from 
the tobacco plastid genome. However, whether or not these 
observations reflect the natural stages of DNA modifications in 
chloroplast genomes is unknown. As a result, identification of 
potential DNA modification sites in the S. miltiorrhiza chloroplast 
genome would provide valuable information on the epigenetic 
regulation of its gene expression. 

In this study, we used the Single-Molecule Real-Time (SMRT) 
Sequencing and strand-specific RNA-Seq technologies to charac- 
terize in detail the S. miltiorrhiza chloroplast genome, transcrip- 
tome, and DNA modifications. Detailed analyses of DNA 



modifications as well as of ncRNA and cRNA expression suggest 
a complex interplay between these processes. The results of this 
study would provide valuable information for the construction of 
chloroplast-based transgenic systems, as well as lay the foundation 
for research on ncRNA and DNA modifications in regulating the 
expression of coding genes in chloroplasts from other organisms. 

Materials and Methods 

Materials and DNA sequencing 

Fresh leaves from three plants of S. miltiorrhiza Bunge cv 99-3 
were collected from the Institute of Medicinal Plant Development, 
Beijing, China. The leaves are pooled together and total DNA was 
extracted from the leaves using a plant genomic DNA kit 
(Tiangen, China). The genomic DNA of S. miltiorrhiza were 
isolated and subjected to DNA sequencing using SMRT 
sequencing, a parallelized single molecule DNA sequencing by 
synthesis technology developed by PacBio (Pacific Biosciences, 
Menlo Park, California, United States of America). The DNA 
sequencing is done on a SMRTceU containing hundreds thousand 
of zero-mode waveguide (ZMW) nano-pores. Inside each ZMW, a 
single active DNA polymerase with a single molecule of single 
stranded DNA template is immobilized to the bottom through 
which light can penetrate and create a visualization chamber. The 
nucleotide substrate is phosphor-linked to a fluorescence dye. As 
the DNA synthesis proceeds, the fluorescence signal is detected 
when a nucleotide is held by the DNA polymerase before the 
nucleotide being incorporated into the DNA strand, resulting in 
the DNA sequencing in real time. We choose SMRT technology 
for this study as it is the only technolog)' that is currentiy capable 
to directiy detect potential modifications on native DNA without 
PCR amplification [24], The DNA was used to construct two 
libraries with the insert sizes of 1 kb and 10 kb, which were 
subjected to DNA sequencing following the standard protocol 
provided by the manufacturer (PacBio). 

Genome Assembly 

To assemble the genome, we downloaded 277 chloroplast 
genomes from GenBank on October 2012. These sequences were 
used to search against PacBio reads for S. miltiorrhiza. Similar reads 
with an E value < le-5 were extracted and subjected to sequence 
assembly using Si;qman (version 9.0). The genome sequence of 
Sesamum indicum, which has the highest overall sequence similarity 
with S. miltiorrhiza reads, was used as a reference to order the 
contigs. The gaps were filled by iteratively searching the PacBio S. 
miltiorrhiza reads using the contigs, extracting reads that share 
sequence similarities, extending the ends, adding the reads to the 
assembly, and then performing reassembly. After obtaining an 
initial assembly, we designed primers (Table SI in File SI) to 
amplify the four junctions between the single-copy segments and 
the inverted repeats. We sequenced the PCR products using a 
BigDyeV3.1 Terminator Kit for 3730XL (Life Technologies) and 
assembled high-quality Sanger sequences into the initial assembly 
to obtain th(" final assembly using Seqman (DNASTAR, WI). The 
final assembly has been deposited into EMBL (accession number 
HF586694). Note that another chloroplast genome oiS. miltiorrhiza 
was published during the duration of this study [9]. After 
comparison, only 41 base diflerences were found between the 
two genome assemblies; the general features are otherwise 
identical (results not shown). 

RNA extraction and RNA-Seq analysis 

Total RNA was extracted from fresh leaves of three S. 
miltiorrhiza plants using an RNAprep pure plant kit (Tiangen, 
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China) separately following the manufacturer's recommendation. 
The three RNA samples were pooled and the cDNA was amplified 
using a random-priming protocol. Ribosomal RNA was removed 
from total RNA using MICROBi?x/)re,« Bacterial mRNA Enrich- 
ment Kit (Ambion, AM 1905). A RNA-Seq library was constructed 
following a strand-specific protocol (dUTP, Dlumina) [25]. The 
resulting cDNA were cleaved into small fragments (300 bp to 
400 bp) to construct sequencing libraries, followed by emulsion 
PGR and then sequenced according to the manufacturer's 
protocols. The resulting data can be found in GenBank with 
acces.sion number SRA1045051. All reads were mapped to the 
chloroplast genome assembly using Tophat (vl.3.2) [26]. The 
polycistrons and ncRNA were identified manually based on the 
identification of contigs assembled from overlapping RNA-Seq 
reads along the S. miltiorrhiza genome assembly. The 5' and 3' ends 
were determined based on the very ends of each contig. The 
RNA-Seq data were visuahzed and quantified using SeqMonk 
(vO.23.0, Babraham Bioinformatics, UK) and Tablet (vl. 12. 12.05) 
[27]. SeqMonk is able to call the abundances/expression levels for 
given regions and output the abundance as number of Reads Per 
Million (RPM). Gonsidering the low expression levels of ncRNA, 
no coverage cutoff was set. In another word, the minimal coverage 
is set as 1 for the RNA-Seq data analysis. 

Strand-specific real-time quantitative PCR 

Total RNA from leaves of three plants were extracted as 
described above. Strand-specific real-time quantitative PGR (ss- 
qPCR) was performed as described previously [28]. Briefly, 
twenty-four pairs of antisense ncRNA and cRNA were randomly 
selected, and ss-qPCR primers were designed (Table S2 in File SI). 
A sequence similarity search was conducted to ensure that the 
primer sequences would uniquely amplify the expected regions. 
Strand specific cDNA templates were synthesized with tagged 
primers to add 18 to 20 nucleotide tags unrelated to S. miltionhiza 
(Table S2 in File SI: cRNAtag; GGTAGGTTGAGGTAGG- 
GATC and mRNAtag; CGAGATCGTTGGAGTGGT) at the 5' 
end (Table S2 in File SI) [28]. Reverse transcription with the 
tagged primer was performed using a PrimeScript RT-PCR Kit 
(Takara). The PGR was performed as follows: a 10 ^l mixture 
containing approximately 500 ng of total RNA, 1 |il of dNTP 
mixture (10 mM each), and 2 pmol of tagged primer was heated 
for 5 min at 65°G and then cooled to 4°G. The reaction mixture 
contained 4 ^1 of PrimeScript buffer (5 x), 0.5 \l\ of RNase 
inhibitor (40 U/nl), 0.5 \l\ of PrimeScript RTase, and 5 |xl of 
RNase-free water. The RT reaction was conducted at 45 °G for 
30 min and terminated by heating at 70°G for 15 min. The 0.5 \A 
reverse-transcription product was amplified with an SYBR Premix 
Ex Taq kit (Takara) according to manufacturer's instructions. The 
qPGR experiment was performed on an ABI 7500 Fast instrument 
(Applied Biosystems). The cycle conditions were as follows: 95°G 
for 10 min, followed by 40 cycles of 95°G for 5 s, and 60°G for 
34 s. The primers used are listed in Table S2 in File SI. Melting 
curve acquisition and analysis were performed immediately after 
amplification using default settings, which are 95°C for 15 sec, 
followed by 60°G for 1 min, and 95°G for 15 sec. 

Validation of RNA-Seq results with ss-qPCR 

To validate the strand specificit)' and expression abundance of 
ncRNA detected in our RNA-Seq data, three total RNA samples 
extracted from leaves of three S. miltiorrhiza plants were analyzed 
using ss-qPGR as described above. The expression quantification 
by qPGR was performed as described previously [29] and a cRNA 
coding for atpQ gene was chosen as the internal control for each ss- 
qPGR experiment. 



For RNA-Seq data, we first specified the start and end of the 
overlapping regions for each pair of ncRNA and cRNA. SeqMonk 
was used to calculate the RPMs for ncRNA and cRNA for the 

overlapping regions. Validation of differential gene expression 
between ncRNA and cRNA was performed as previously 
described [30,31]. Briely, the log fold change, or log ratio, of 
ncRNA and cRNA was calculated as log[(RPM for ncRNA)/ 
(RPM for cRNA)]. For ss-qPGR data, the log ratio for each 
technical replication was calculated as [(40-ct) for ncRNA] - [(40- 
ct) for cRNA] . The log ratios of the three technical replications 
were averaged to create the mean and standard deviation of log 
ratio for each biological replicate. The statistical significance 
between the RNA-Seq and ss-qPGR results was tested using one- 
sample t test. 

Detection of DNA modifications and identification of 
DNA modification motifs 

The total DNA was sequenced using SMRT technology 

following the standard procedures as described above. Sequence 
preprocessing and DNA modification detection were performed 
using the SMRT Analysis pipeline (version 1.3.2). The DNA 
modification motifs were identified using MotifFinder (PacBio). 

Statistical Analysis 

Statistical analyses, including one-sample t test and analysis of 
variance, were conducted using the JMP software (v9.0, SAS, NG) 
and customer scripts written with Perl module Statistics::Distribu- 

tions (vl.02). 

Results 

Integrative analyses of the S. miltiorrhiza chloroplast 

We used strand-specific RNA-Seq, and SMRT technologies 
(Figure SI in File SI) to characterize the genome, transcriptome, 
and DNA modifications of the S. miltiorrhiza chloroplast. An 

ideogram summarizing our results is shown in Figure 1, which 
include the following information: the putative DNA modification 
sites on the positive strands (a) and negative strands (b); the 
expression level of noncoding genes on the positive strands (c) and 
n(-gati\ e strand (d); the expression level of coding genes on the 
positive strands (e) and negative strand (f); and the identified 
polycistrons on the positive strands (g) and negative strands (h). 
This ideogram provides an overall picture of the transcriptome 
and DNA modifications in the S. miltiorrhiza chloroplast, which are 
described in detail below. 

Analysis of chloroplast gene expression in leaves using 
RNA-Seq technology 

We conducted strand-specific RNA-Seq analysis of the S. 

miltiorrhiza transcriptome to understand the gene expression 
landscape of the S. miltiorrhiza genome. Chloroplast genes are of 
prokaryotic origin. Therefore, we used a protocol designed to 
analyze prokaryotic transcriptome. We generated up to 
53,849,024 read pairs. Using Tophat [26], 553,014 (1.0%) of tiie 
reads were mapped to the S. miltiorrhiza chloroplast genome. In a 
previous study, 133 genes have been identified in the S. miltiorrhiza 
chloroplast genome [9], including 131 predicted functional genes 
and 2 pseudogenes. The functional genes consist of 114 unique 
genes, including 80 protein-coding genes, 30 tRNA genes, and 4 
rRNA molecules. For genes with more than one copy, we selected 
only one for further analysis. Except for 14 unique genes, all genes 
were expressed, including 80 protein-coding genes, 4 rRNA genes, 
and 16 tRNA genes (Table 1). Fourteen tRNA were not found 
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Figure 1. Ideogram showing the transcriptome and DNA modifications of the Salvia miltiorrhiza chloroplast genome. The predicted 
genes are drawn on the outernnost circle. Those drawn outside the circle are transcribed in a clocl<wise direction, whereas those drawn inside the 
circle are transcribed in a counterclocl<wise direction. The distributions of DNA modification motif 1 on the positive strand (a) and the negative strand 
(b) are shown as bars. The red bar indicates motifs that are significantly modified. The expression levels of the noncoding genes on the positive (c) 
and negative strands (d); the coding genes on the positive (e) and negative strands (f) are shown in black and white arcs on grey background, The 
polycistrons on the positive (g) and negative strands (h) are shown as green and yellow arcs on white background. The heights of the arcs in (c) to (h) 
represent the expression levels of the corresponding transcripts. 
doi:10.1371/journal.pone.0099314.g001 



expressed. One possibility is that the tRNA transcript lengths 
range from 70 bp to 92 bp. These transcripts were excluded from 
our Ubrary because only fragments that are 300 bp to 400 bp long 
were selected to construct the library. By contrast, 16 tRNA genes 
were expressed, 9 of which were embedded in polycistrons 
(Table 1) and 3 have large introns. The remaining four tRNA 



coding genes (frnF-GAA, taN-GUU, frnS-GGA, and frnQ.-UUG) 
were embedded in long noncoding transcripts (see below). 

These genes were quantified using the Seqmonk software. The 
results are shown in Table 1. The most highly expressed genes 
were the rRNA 23S and rRNA 16S, as well as pshP^, pshB, psbC, 
psbD, and rbch. This result is consistent with the involvement of 
rRNA in protein synthesis as well as with the participation of psb 
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A Intergenic ncRNAs 




A3 ^mlX^ < ncRNA 

A4 Bilateral ncRNAs 




B Antisense ncRNAs 
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B3B 
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Figure 2. Schematic representations of various types of ncRNA 
identified in this study. Protein-coding genes and their exons are 
presented in blacl<, whereas ncRNA are presented in grey. The arrows 
indicate the direction of the transcripts. (A) Intergenic ncRNA, ncRNA 
expressed in the intergenic region. Four different types, namely, A1, A2, 
and A3, A4 are shown. (B) Antisense ncRNA, ncRNA transcribed from the 
antisense strand of protein-coding genes. This ncRNA overlaps with 
either the 5' (B1 ) end or 3' (B2) end of the coding region, is embeded in 
(B3A) or spans (B3B) one entire protein-coding gene, or spans two 
protein-coding genes (B4). Intronic asRNA (B5) is transcribed entirely 
from an intron of a coding gene. 
doi:10.1371/journal.pone.0099314.g002 

genes and rbch in photosynthesis. Interestingly, two tRNA genes, 
namely, taA-UGC and frnl-UUC, were highly expressed at levels 
comparable to those of rRNA genes and psh subunit genes. The 
taA-UGC and tal-GAU transcripts accounted for 60.60% and 
38.18% of all tRNA transcripts, respectively, even though alanine 
and isoleucine only account for 2/30 (6.7%) of the amino acids in 
the S. miltiorrhiza genome. Detailed examination showed that these 
two tRNA genes are located in polycistron pel 7, which contains 
the rRNA genes. Their expression levels are consistent with those 
of other genes in the same polycistron. Therefore, this region is a 
potential site for foreign gene insertion (see below). 

Polycistron identification 

Most genes from chloroplast genomes are transcribed in 
polycistrons [32]. We determined the polycistrons in the S. 
miltiorrhiza chloroplast genome based on the presence of a 
continuous transcript. In total, we found 19 polycistronic 
transcripts containing 71 genes, which include 58 coding-protein 
genes, 9 tRNA, and all 4 rRNA. The polycistron identity, its 
member genes, and the strand on which the gene is located are 
shown in Table 1 . By comparing with the results reported by Yang 
et al. [33], the polycistron (pel) was identified for the first time. 



The transcripts for a number of genes [atpl, rpoC2) were not 
entirely covered by the RNA-Seq reads. However, based on the 
results of a previous study [33], we consider these cases as 
examples of biased coverage of the sequencing technologies. 
Moreover, the corresponding broken transcripts (polycistron pc3) 
were treated as a continuous polycistronic transcript. 

ncRNA identification and classification 

A recent study found unexpected diversity in the ncRNA in the 
chloroplast transcriptome [10]. We performed a systematic 
analysis to identify ncRNA in the chloroplast genome of S. 
miltiorrhiza. To be consistent, we defined coding RNA (cRNA) as 
any RNA transcripts transcribed from genes encoding proteins. 
We define non-coding RNA (ncRNA) as any RNA transcripts 
transcribed that do not encode proteins and have a length > 
= 100 bp. We identified 136 ncRNA transcripts based on these 
criteria (Table 2). These ncRNAs are classified into two groups, 
namely, intergenic ncRNA (Figure 2A) and antisense ncRNA 
(asRNA, Figure 2B), based on their positions relative to the cRNA 
genes. Intergenic ncRNA are those located between two 
transcripts or polycistrons and its distance to the start or end 
position of each transcript was at least 100 bp (Figure 2A). Three 
types, namely, Al, A2, and A3, were found. A special case of 
intergenic ncRNA expresses from both strands and was defined as 
bilateral ncRNA (A4, Figure 2A). 

asRNA are those located on the antisense strand of a cRNA 
gene and overlaps with cRNA gene regions. Depending on their 
location relative to the corresponding coding region of the cRNA 
gene, the asRNA were further divided into five types: 5' (Bl), 3' 
(B2), coding (B3), span (B4) and intron (B5) asRNA (Figure 2B). Bl 
includes those asRNA transcripts that overlap with the region 
upstream of the coding region of cRNA gene but do not cover die 
entire coding region. B2 are those overlap with the region 
downstream of the coding region of cRNA but do not cover the 
entire coding region. B3 are those either overlap with part of the 
coding region (B3A), or cover the entire coding region (B3B). B4 
include those span two adjacent cRNA genes. B5 include those 
located within the intronic region of the cRNA gene. In total, 136 
ncRNA transcripts have been identified (Table 2). Their sizes 
range from 135 bp to 1296 bp, with a mean of 263 bp. They 
include 1 8 intergenic and 1 1 8 antisense ncRNA. The numbers of 
ncRNA transcripts belonging to various subtypes are: Al: 1; A2: 4; 
A3: 9; A4: 4; Bl: 9; B2: 9; B3A: 72; B3B: 6; B4: 11; B5: 11; 
respectively. 

To determine if these ncRNA are conserved across various 
plant chloroplast genomes, we compared these sequences with 
those from A. thaliana. A previous study has identified 107 ncRNA 
in A. thaliana chloroplast genome [10]. In total, 45 S. miltiorrhiza 
ncRNA were found similar to 39 A. thaliana ncRNA using 
BLASTN with an a cutoff of E value <le-5, (Table 3). Four types 
of mappings were identified, which are 1:1, l:n, n:l and n:n 
respectively. "1:1" means that one S. miltiorrhiza ncRNA was 
mapped to one and only one A. thaliana ncRNA. "l:n" means that 
one S. miltiorrhiza ncRNA was mapped to multiple A. thaliana 
ncRNA. "n:l" and "n;n" mean that multiple S. miltiorrhiza 
ncRNAs were mapped to one or multiple A. thaliana ncRNAs 
respectively. In particular, there are five n: 1 mappings between S. 
miltiorrhiza and A. thaliana ncRNA. Moreover, one S. miltiorrhiza 
ncRNA (nc99) was similar to two A. thaliana ncRNA [ndhS-5' -I, 
ndhF-5'-2), a l:n mapping. Last, nc92 and nc93 of S. miltiorrhiza are 
similar to trnL-ndhRmt and ndhB-3' oi A. thaliana, a n:n mapping. 
However, we can not exclude that possibilities that some of these 
n:l,l:n and n:n mappings resulted from partial transcripts. These 
conserved ncRNA between S. miltiorrhiza and A. thaliana may have 
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Table 3. List of ncRNAs from S. miltiorrhiza and A. thaliana sharing sequence similarity. 



5. miltiorrhiza 


A. tlialiana 




Identity (%) 


E value 


ncRNAs ID 


asRNA Name 


asRNA Location 






nc5 


frnQ 


6521_6646 


100 


3.00e-1 1 


nc6 


psfaK 


6882_7462 


88.7 


6.00e-22 


nc7 


fmS.1 


7817_7984 


97.5 


1 .OOe-1 5 


nc29 


trnG-3' 


9244_9484 


96.3 


7.00e-08 


nc8 


atp^ 


10342_10536 


84.29 


1 .OOe-09 


nc22 


trnD 


29627_29825 


91.89 


6.00e-09 


nc24 


psfaD 


33106_33456 


96.74 


5.00e-42 


nc25 


psbD 


33106_33456 


94.44 


2.00e-69 


nc30 


rpsl4-psaB\nt 


37250_37327 


93.75 


2.00e-08 


nc35 


Ycf3-3' 


42542_42629 


88.64 


3.00e-08 


nc41 


Rps4-5' 


45747_45929 


94.44 


7.00e-1 1 


nc67 


petA 


62045_62271 


89.43 


7.00e-58 


nc69 


psb] 


63479_63648 


90.62 


2.00e-51 


nc70 


psbL-psbF 


63805_64018 


93.59 


3.00e-49 


nc72 


orf31 


65708_65805 


90.36 


2.00e-24 


nc74 


Rps 18 


67784_68222 


95.36 


2.00e-1 09 


nc80 


psfcB 


73677_73837 


91.3 


8.00e-57 


nc81 


psbH 


74601 _74726 


96.83 


3.00e-27 


nc84 


rpoA 


78320_78768 


84.09 


2.00e-32 


nc85 


rpoA 


78320_78768 


86.02 


4.00e-43 


nc86 


Rp/36 


79502_79858 


94.29 


2.00e-10 


nc91 


trnlA 


86203_86389 


94.65 


2.00e-80 


ncl37 


YcflA-l 


87948_88212 


94.59 


8.00e-98 


ncl36 


YcflA-S 


88451_88564 


94.81 


1 .OOe-30 


ncl34 


yc/2.1-4 


88960_89328 


93.08 


1 .OOe-62 


nc131 


VcQ.I-S 


89445_90769 


94.94 


5.00e-79 


ncl32 


VcQ.I-S 


89445_90769 


95.95 


3.00e-105 


ncl33 


VcQ.I-S 


89445_90769 


87.12 


4.00e-33 


ncl34 


VcQ.I-S 


89445_90769 


89.02 


3.00e-20 


nc129 


yc/2.1-7 


91 703_92005 


90.73 


1 .OOe-73 


ncl27 


ycf2.1-8 


92309_92432 


95.83 


4.00e-18 


nc92 


trnL-ndhBint 


94377_94967 


86.43 


4.00e-90 


nc92 


trnL-ndhB\nt 


94377_94967 


89.25 


4.00e-25 


nc92 


trnL-ndhS'mt 


94377_94967 


93.55 


2.00e-1 8 


nc92 


ndhB-3' 


95111_95707 


95.24 


3.00e-44 


nc93 


ndhB-3' 


951 1 1_95707 


96.1 


1.00e-1 10 


nc95 


ndhB\ 


95910_96164 


94.9 


1 .OOe-67 


nc98 


ftps 12-5' 


98718_98822 


98.1 


1.00e-51 


ncl 14 


rrn4.5-rrn5\nt 


ia7905_1 07963 


97.83 


2.00e-19 


ncl 12 


Vcfl.l-I 


109354_1 09553 


97.96 


4.00e-21 


ncl 10 


ycn.1-2 


ia9923_1 10387 


94.62 


9.00e-38 


ncl 10 


ycfl.1-2 


109923_1 10387 


92 


2.00e-14 


nc99 


ndhF-5'-1 


112345_1 12435 


85.92 


1.00e-12 


nc99 


ndhF-5'-2 


112475_1 12665 


88.76 


2.00e-23 


ncl 03 


ndhD-2 


1 1 61 37_1 16437 


91.18 


2.00e-07 


ncl 05 


ndhM 


120852_1 20948 


86.67 


1.00e-06 


ncl 07 


ndhh-5' 


121398_121822 


92.19 


7.00e-47 


ncl 08 


ndhA-5' 


121398_121822 


87.72 


1.00e-11 
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Table 3. Cont. 



S. miltiorrhiza A. thaliana Identity (%) E value 

ncRNAsJD asRNA Name asRNA Location 

ndlO Ycf\. 2-7 128220_1 29004 94.62 9.00e-38 

ncllO Yen 2-7 128220_1 29004 92 2.00e-14 

nclll yen. 2-7 1 28220_1 29004 95.73 9.00e-110 



doi:10.1371/journal.pone.0099314.t003 

similar functions. Furtiier investigations are required to test this 
possibility. 

Validation of RNA abundance and differential expression 
from RNA-Seq data using ss-qPCR 

We first calculated the correlation between the RNA abundance 
obtained fi-om RNA-Seq and the three ss-qPCR results respec- 
tively. For these 24 pairs of ncRNA and cRNA, pearson 
correlation coefficients of 0.749, 0.764 and 0.779 were obtained 
(data not shown). 

The ncRNA and its corresponding cRNA can be considered as 
two different conditions for differential gene expression study. We 
compared the log fold changes, also called log ratios, of the 
expression levels for ncRNA and cRNA obtained from RNA-Seq 
and ss-qPCR experiments. The calculation of the log ratio of 
ncRNA/cRNA from RNA-Seq and each of the ss-qPCR 
experiments is described in the method section. The log ratios 
are shown in Figure 3. For a pair of ncRNA and cRNA, the 
statistical significance between the RNA-Seq and the ss-qPCR 
results were tested using one-sample ("test (Table S3 in File SI). A^ 
value <0.05 was considered significantly different. In total, 1 7 out 
of 24 pairs of ncRNA and cRNA (70%) did not exhibit 
significantiy different gene expression between RNA-Seq and ss- 
qPCR results (Figure 3). The 7 pairs showed significantiy different 
expression between the RNA-Seq and ss-qPCR results are 
indicated with "*". Overall, these findings demonstrate the 
reasonably good quality of our RNA-Seq data. 

Interplay between asRNA and cRNA expression 

To detect any interplay between cRNA,and asRNA, we 
determined the expression levels of cRNA and asRNA transcripts 
of all 80 protein-coding genes (Table 2). It should be pointed out 
that, in Table 1 , the start and end positions of cRNA are defined 
based on those of ncRNA. While in Table 1 , the start and end 
positions of cRNA were defined based on those from gene 
annotations. At a result, the expression levels are slightly different 
for the same cRNA in Table 1 and 2. The expression of antisense 
strands was not detected in 27 of the 80 protein coding genes. A 
Students' i-test showed that the abundance of cRNA with asRNA 
expression was significantly higher than those of cRNA with no 
ncRNA expression (/)<0.05) (Figure 4A). This result suggests that 
asRNA expression promotes cRNA expression on the opposite 
strand (see the Discussion section for additional details). As 
previously described, asRNA can be divided into four types based 
on their relative positions to the coding transcripts. To detect any 
correlation between the asRNA locations and cRNA expression, 
we plotted the asRNA expression by their types (Figure 4B). The 
average abundance of cRNA with 3 ' coding asRNA and intronic 
asRNA were similar to those shown in Figure 4A. However, the 
abundance of cRNA with 5' asRNA was considerably lower but 



the difference was not statistically significant, which indicates the 
possible repressive effects of asRNA on cRNA abundance. 

Identification of DNA modification sites and DNA 
modification motifs 

DNA modifications influence cellular functions. SMRT tech- 
nology no only produces considerably longer reads than other 
second-generation DNA sequencing technologies but also provides 
detailed kinetic information of the DNA polymerase. As described 
earlier, SMRT sequencing technology detects the fluorescent 
signal during the period when the DNA polymerase is "holding" a 
nucleotide. These form fluorescent "pulses" as DNA polymerase 
continue incorporating nucleotides into the elongating DNA 
strand. The interpulse duration (IPD) is the time difference 
between two "pulses". IPD ratio is calculated as [IPD at a site at a 
condition under study] /[IPD at the same site at the control 
condition] , The IPD changes significantly in the presence of DNA 
modifications. Based on the IPD ratio, the DNA modification 
status of the base can be inferred [20]. Using the DNA 
modification analysis module of SMRT Portal 1.3.2, we identified 
2687 (/;<0.01) putatively modified bases throughout the genome 
(Figure 5A). Among these, 90.3% IPD ratio ranged from -2 to 2. 
We identified two DNA motifs using the motif analysis module, 
namely, "TATANNNATNA" (Figure 5B) and "WNYANT- 
GAW" (Figure 5C), whose percentage of putatively modified sites 
was 35/97 (36.1%) and 91/369 (24.7%), respectively (Table 4). 
Detailed information on the sites belonging to motifs 1 and 2 is 
included in Tables S4 and S5 in File SI, respectively. 

Interplay between DNA modification and gene 
expression 

The sequence of DMMl, "TATANNNATNA," is simUar to 
that of a TATA box frequently found in the core promoter of 
prokaryotes. Therefore, whether the modifications of these 
putative TATA box motifs would affect the expression of genes 
downstream of the modification sites is of significant interest. We 
extracted the transcripts corresponding to genes coding for 
protein, tRNA and ncRNA downstream of these DMM 1 motifs. 
The downstream transcripts were further classified into three 
groups based on the distance between the modification motifs and 
the start of the downstream genes. For the group in which aU 
transcripts were ncRNAs and the distance between DMMl and 
the start of ncRNAs ranged from 100 bp to 500 bp (Table S6 in 
File SI), the presence of DNA modification was significantiy 
correlated with higher ncRNA expression levels. The difference 
was significant with j6<0.01 (Figure 6 and Table S7 in File SI). 

Prediction of sites for foreign gene insertion 

We systematically analyzed the regions to select those that are 
suitable insertion sites. Our criteria included the following: (1) the 
region should have minimal effects on the expression of 
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] RNA-Seq 
I s5-qPCR of plant 1 
i ss-qPCR of plant 2 

a ss~qPCR of plant 3 





-15 -10 -5 0 5 10 

Log[Abundance(ncRNA/cRNA)] 

Figure 3. Validation of RNA-Seq data using ss-qPCR. Specific 
primers were designed and used to measure tine abundance of the 
transcripts that correspond to the positive and negative strands using 
ss-qPCR. The log ratio (X axis) of the abundance of ncRNA and cRNA 
transcripts was used to measure the differential expression of this locus. 
(*) indicates those transcripts whose log ratio derived from the RNA-Seq 
result is significant different from those of ss-qPCR (p<0.05). 
doi:1 0.1 371/journal.pone.009931 4.g003 

endogenous chloroplast genes; and (2) it has potential to be highly 
expressed. In particular, we selected locations that are in the 
intergenic regions of polycistronic sites and at which the relative 
expression levels of genes are high. The top-ranking regions are 
presented in Figure 7. Interestingly, these regions include frnl- 
GAU and taA-UGC (Figure 7C), which have been widely used in 




Absence Presence 
of asRNA of asRNA 
Protein-coding Genes 




3' Coding Intronic 
Types of asRNAs 



Figure 4. Effects of asRNA on the expression of the corre- 
sponding cRNA. (A) log abundance of cRNA with and without asRNA. 
(B) log abundance of cRNA in the presence of four different types of 
asRNA. The line in the middle of the box represents the mean. The 
upper and lower borders of the boxes indicate the 75% and 25% 
quantile lines, respectively. The ends of the whiskers represent two 
standard deviations from the mean. "*" indicates that the difference is 
significant (p<0.05) based on a Student's f-test. 
doi:10.1371/journal.pone.0099314.g004 

chloroplast-based genetic engineering [34]. The consistency 
between our prediction and those already used experimentally 
indicates that our selection criteria are reasonably well. 

Discussion 

RNA-Seq technology 

The current study takes advantage of two advanced technolo- 
gies, RNA-Seq and SMRT, to identify asRNA and DNA 
modifications and then characterize their effects on gene 
expression in the S. miltiorrihiza chloroplast genome. RNA-Seq is 
a revolutionary tool for transcriptomic analysis [35] . It has several 
advantages over other transcriptomic methods such as tiling 
microarray and cDNA or EST sequencing [36], namely, 
independence from genomic sequence, low background noise, 
wide dynamic range to quantify gene expression level, and high 
throughput, among others. Moreover, strand-specific RNA-Seq 
methods provide information on the orientation of transcripts. 
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Figure 5. Identification of putative DNA modification sites and 
motifs in the 5. miltiorrhiza chloroplast genome using SMRT 
technology. (A) Distribution of interpulse duration (IPD) Ratio across 
the genome. "A" indicate that the corresponding IPD ratio ("A") has a 
p<0.05, which suggests that the base at this position is modified. (B) 
Sequence LOGO for putative DNA modification motif 1 (DMMI). (C) 
Sequence LOGO for putative DNA modification motif 2 (DIV1M2). 
doi:1 0.1 371 /journal.pone.009931 4.g005 

which is valuable for transcriptome annotation particularly for 
regions with overlapping transcription from opposite directions 
[25]. As demonstrated in the current study, strand-specific RNA- 
Seq technology enables the detection of polycistrons and asRNA 
from chloroplast transcriptome, which is essential for interpreting 
the functional elements of chloroplast genomes. 

SMRT technology 

SMRT is a novel technology for detecting DNA methylation 
while simultaneously determining the context of the corresponding 
DNA sequence [20]. Before the development of SMRT technol- 
ogy, the most commonly used methods for detecting DNA 
modification are largely limited to bulk methods such as 
chromatography, mass spectroscopy, electrochemistry, radioactive 
labeling, immunochemical assays, and sensitivity to restriction 
enzymes [20]. However, these approaches suffer from low 
resolution and an inability to identify the precise sequence context 
of the methylation site(s). The SMRT technology allows the 
collection of kinetic data for the enzyme during the incorporation 
of each dNTP into the DNA strand. Significant changes in kinetic 
parameters such as IPD ratio should be observed when the DNA 
polymerase encounters m6A, m5C, or 5-hmC on the template 
strand. These distinct kinetic signatures allow the identification of 
the type and position of the base modification in the DNA 
template [20]. The SMRT technology is likely to enhance DNA 
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Unmodified Modified 
CPGDMMI 

Figure 6. Effects of modification at the CPGDMM1 motif on the 
abundance of ncRNA. Box plot showing the differences in the 
expression of two groups of ncRNA transcripts (x axis), which are 
defined based on whether or not their upstream CPGDIV1M1 motifs are 
significantly modified. The y axis shows the arbitrary expression levels of 
ncRNA transcripts in these two groups. The line in the middle of the box 
represents the mean. The upper and lower borders of the boxes 
indicate the 75% and 25% quantile lines, respectively. The ends of the 
whiskers represent two standard deviations from the mean. "*" indicate 
that the difference is statistically significant using ANOVA (p<0.01). 
doi:10.1371/journal.pone.0099314.g006 

modification studies on samples that were not previously accessible 
for this type of research, such as organelle genomes. 

Gene expression regulation by ncRNA 

Several mechanisms have been proposed to explain the 
regulatory role of ncRNA, including transcription itself (transcrip- 
tional interference), as well as the formation of duplexes of RNA- 
DNA (chromatin remodeling) and RNA-RNA (dsRNA-depen- 
dent RNA formation) [37]. Under the first model, the abundance 
of natural sense-antisense transcript (SAT) pairs was found to be 
inversely correlated and antisense transcription leads to the 
downregulation of the corresponding sense transcript. Under the 
second model, the antisense transcript recruits histone-modifying 
enzymes that alter the chromatin structure, resulting in sense- 
transcript repression. Consistently, sense transcription is induced 
in the absence of antisense transcripts. Under the last model, 
double-stranded RNA formation spans an intron, which inhibits 
splicing and leads to a mature transcript with a retained intron. 

Although the majority of studies on asRNA focused on those 
from nuclear genomes, some studies focused on asRNA in the 
chloroplast genome. For example, overexpression of a natural 
chloroplast-encoded antisense RNA (ASS) in tobacco destabilizes 
5S rRNA and retards plant growth, possibly because ASS prevents 
the accumulation of misprocessed 5S rRNA and controls its 
stoichiometry [18]. The majority of previous studies showed that 
asRNA expression is negatively correlated with that of the cRNA. 
However, positive correlation has also been found between asRNA 
and cRNA in studies on human breast epithelium and mammals 
[38,39]. 

In our study, we showed that at the genome scale, the cRNA 
expression levels in SATs are significantly higher than those not in 
SAT (p<0.05) (Figure 4A). In other words, cRNA expression is 
positively correlated with asRNA expression. One likely mecha- 
nism is that asRNA expression might open up the chromatin and 
increase the accessibility of the corresponding region to the 
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Figure 7. Schematic representation of the optimal insertion 
positions for foreign genes identified in the current study. The 

names, genomic positions, and orientations of the genes in three 
different loci, namely, rps^4-psaB-psah, ndhJ-ndhK-ndhC, and rrn^6S- 
(fml-GAU)-(frnA-UGC)-rm23S, are shown in panels (A), (B), and (C), 
respectively. P: positive strand; N: negative strand. 
doi:10.1371/journal.pone.0099314.g007 

transcriptional machinery for cRNA, causing higher expression of 
the corresponding cRNA. Nevertheless, we cannot exclude the 
possibility that cRNA expression causes the higher asRNA 
expression in the same manner. Additional studies are in need 
to test these hypotheses. 

Gene expression regulation by DNA modification 

Previous studies indicated that gene transcription and DNA 
methylation are closely interwoven processes [40] . In mammalian 
genomes, DNA methylation is generally associated with silent 
CpG island promoters. However, the majority of CpG island 
promoters remain methyl-free regardless of their expression status 
[41]. In human testicular cancer, the addition of methylation 
inhibitors leads to downregulation of ncRNA expression [42]. In 
plants, a spontaneous mutation causing a single nucleotide 
polymorphism (SNP) between wild-type and mutant genes alter 
the secondary structure of LDMAR, an IncRNA gene. This 
change increases methylation in the putative promoter region of 
LDMAR, which reduces LDMAR transcription specifically under 
long-day conditions [43] . 

Although the expression regulation of chloroplast genes has 
been intensively studied [10], the presence of ncRNA and DNA 
modifications and their interaction with the expression of cRNA 
remain unclear. In the current study, we use SMRT technology 
and detected possible base modifications in chloroplast genomes. 
Furthermore, we identified two motifs that are associated with 
base modifications. As the first motif is similar to the TATA box in 
sequence, we hypothesize tiiat it might regulate the expression of 
downstream genes as a transcriptional factor binding site. After 
classifying the downstream genes by gene type and their distance 
to the putative TATA box, the modification was found to be 
associated with significantiy higher expression of ncRNA but not 
those protein-coding and tRNA genes. One explanation is that 
proteins and tRNA-coding genes might be regulated by specific 
elements such as promoters and enhancers, and that base 
modification plays a limited role in their expression regulation. 
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However, ncRNA expression regulation is less specific, and the 
modification of its upstream TATA box-like motif leads to 
significant differences in expression. Contrary to the general 
thoughts that DNA methylation downregulate gene expression, we 
found that CPGDMMl methylation is correlated with increased 
ncRNA expression. These observations may reflect the unique 
mechanisms present in chloroplast genomes. Further investigation 
is needed to confirm these associations and to elucidate possible 
mechanisms. 

Chloroplast based genetic engineering 

Chloroplast vector systems have several advantages and 
consequently have multiple biotechnological applications [4]. 
One of our rationales for this study is to lay the foundation for 
designing efficient, chloroplast-based transformation vectors for S. 
miltiorrhiza. Chloroplast genomes are most abundant in the leaves. 
Therefore, focusing on the chloroplast genome characteristics in 
the leaves is reasonable. After obtaining a detailed expression map 
of S. miltiorrhiza chloroplasts in the leaves, we systematically predict 
regions that are potential insertion sites for foreign genes. Two of 
them have already been commonly used as sites for foreign gene 
insertion. Furthermore, the region between p.sah and psaQ on 
polycistron pc8 and that between ndhK and ndhd on polycistron 
pc9 (Figure 7) were also identified as the most preferred insertion 
sites. These two sites have not been used previously and represent 
novel insertion site for foreign genes. An extensive validation of 
these sites for their application in chloroplast-based genetic 
engineering research will be the subject of a separate study. 

Technical limitations 

Second- and third-generation DNA sequencing technologies 
have made feasible the detection of ncRNA and DNA modifica- 
tions at the genome scale. However, formidable experimental 
difficulties are inherent to antisense RNA and DNA modification 
research. First, most asRNA transcripts are expressed at signifi- 
candy low levels and are thus difficult to be validated using 
classical technologies such as northern blot analysis and in situ 
hybridization. Second, the intricate relationship between sense 
and antisense transcripts means that experimental perturbation of 
one transcript inevitably interferes with the expression of the other 
transcript. Consequently, determining the biological functions of 
the transcripts by knocking-in and knocking-out the transcript 
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